fix(gax): adjust RetryAlgorithm logic for timeout v. exception#13243
fix(gax): adjust RetryAlgorithm logic for timeout v. exception#13243whowes wants to merge 5 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request modifies the RetryAlgorithm to short-circuit the retry check when an operation succeeds and the result algorithm indicates no retry is necessary. This change prevents the timing algorithm from erroneously throwing a CancellationException on successful operations. New unit tests were added to verify both the short-circuiting behavior and the standard path. Feedback suggests adding a missing assertion to one of the new test cases to explicitly verify the return value of the retry decision.
3cff2d3 to
54b9310
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request modifies the RetryAlgorithm.shouldRetry method to short-circuit when an operation has succeeded. Specifically, if the result-based retry check returns false and there is no previous exception, the method now returns false immediately without calling the timing-based retry check. This change prevents the timing-based check from erroneously throwing a CancellationException on successful operations. Corresponding unit tests were added to verify the short-circuiting behavior and ensure timing checks still occur when an error is present. I have no feedback to provide as there were no review comments to evaluate.
239df6a to
17c1ee4
Compare
|
|
| // exception, defer to any exception thrown by shouldRetryBasedOnTiming. | ||
| // This lets a timeout-related exception get propagated when justified | ||
| // rather than e.g. exceptions caused by very short RPC deadlines on a | ||
| // final polling attempt. |
There was a problem hiding this comment.
qq, I'm not sure I'm following this comment. I might not be understanding this fully.
What do you mean by defer to any exception thrown by shouldRetryBasedOnTiming`? Is this referring to an exception thrown by the ExponentialRetryAlgorithm?
IIRC, short RPC deadlines that can't be processed in time will result in a DEADLINE_EXCEEDED status code and that may or may not be retried as a retryable status code
There was a problem hiding this comment.
"Is this referring to an exception thrown by the ExponentialRetryAlgorithm?" Not by ExponentialRetryAlgorithm itself, but by its subclass OperationTimedPollAlgorithm used for LRO polling, which throws a CancellationException when the cumulative timeout for all polls is exhausted.
With the current structure of RetryAlgorithm.createNextAttempt() the CancellationException-throwing codepath doesn't get reached if the final LRO polling RPC fails. That results in an ExecutionException getting propagated instead, which is misleading when that failure is triggered by an overly-truncated deadline. The mismatch of expectations is IMO likely at the root of the test flakes from #12373 and #3277.





The current RetryAlgorithm logic is susceptible to an edge case if a short RPC deadline is set near the end of polling. If an exception results from a too-short deadline, the exception thrown by the associated Future will be an ExecutionException rather than e.g. CancellationException (for LRO polling). This likely causes flakiness observed in #12373 and #3277.
This change ensures that shouldRetryBasedOnTiming executes if the operation hasn't completed successfully, even when shouldRetryBasedOnResult is false (e.g. due to non-retriable RPC exception due to the short deadline). This way a timeout-related exception will be thrown instead if the overall timeout has been reached.
Resolves #12373